Finite Sample Approximation Results for Principal Component Analysis: a Matrix Perturbation Approach1 by Boaz Nadler
نویسنده
چکیده
Principal component analysis (PCA) is a standard tool for dimensional reduction of a set of n observations (samples), each with p variables. In this paper, using a matrix perturbation approach, we study the nonasymptotic relation between the eigenvalues and eigenvectors of PCA computed on a finite sample of size n, and those of the limiting population PCA as n→∞. As in machine learning, we present a finite sample theorem which holds with high probability for the closeness between the leading eigenvalue and eigenvector of sample PCA and population PCA under a spiked covariance model. In addition, we also consider the relation between finite sample PCA and the asymptotic results in the joint limit p,n→∞, with p/n = c. We present a matrix perturbation view of the “phase transition phenomenon,” and a simple linear-algebra based derivation of the eigenvalue and eigenvector overlap in this asymptotic limit. Moreover, our analysis also applies for finite p,n where we show that although there is no sharp phase transition as in the infinite case, either as a function of noise level or as a function of sample size n, the eigenvector of sample PCA may exhibit a sharp “loss of tracking,” suddenly losing its relation to the (true) eigenvector of the population PCA matrix. This occurs due to a crossover between the eigenvalue due to the signal and the largest eigenvalue due to noise, whose eigenvector points in a random direction.
منابع مشابه
Finite Sample Approximation Results for Principal Component Analysis: a Matrix Perturbation Approach
Principal Component Analysis (PCA) is a standard tool for dimensional reduction of a set of n observations (samples), each with p variables. In this paper, using a matrix perturbation approach, we study the non-asymptotic relation between the eigenvalues and eigenvectors of PCA computed on a finite sample of size n, to those of the limiting population PCA as n → ∞. As in machine learning, we pr...
متن کاملRefined Perturbation Bounds for Eigenvalues of Hermitian and Non-Hermitian Matrices
We present eigenvalue bounds for perturbations of Hermitian matrices, and express the change in eigenvalues in terms of a projection of the perturbation onto a particular eigenspace, rather than in terms of the full perturbation. The perturbations we consider are Hermitian of rank one, and Hermitian or non-Hermitian with norm smaller than the spectral gap of a specific eigenvalue. Applications ...
متن کامل694 Journal of the American Statistical Association, June 2009 Discussion
I commend Johnstone and Lu for publishing this important article, which has motivated quite a lot of recent work on sparsity and statistical inference in high-dimensional settings. In their article, Johnstone and Lu present two main results. First, in the presence of considerable noise in the x variables, with a number of samples n not significantly larger than the number of variables p, the sa...
متن کاملNormalized Cuts Are Approximately Inverse Exit Times
The Normalized Cut is a widely used measure of separation between clusters in a graph. In this paper we provide a novel probabilistic perspective on this measure. We show that for a partition of a graph into two weakly connected sets, V = A ⊎ B, in fact Ncut(V ) = 1/τA→B + 1/τB→A, where τA→B is the uni-directional characteristic exit time of a random walk from subset A to subset B. Using matrix...
متن کامل1 6 N ov 2 01 4 Roy ’ s largest root under rank - one alternatives : The complex valued case and applications
The largest eigenvalue of a Wishart matrix, known as Roy’s largest root (RLR), plays an important role in a variety of applications. Most works to date derived approximations to its distribution under various asymptotic regimes, such as degrees of freedom, dimension, or both tending to infinity. However, several applications involve finite and relative small parameters, for which the above appr...
متن کامل